Newest 'bigdata apache-spark' Questions

0votes

0answers

14views

Stuck on loading parquet files recursively of varying size with Spark

I am using Spark on Scala via and Almond kernel for Jupyter to load several parquet files with varying size. I have a single worker with 10 cores and memory allowance of 10GB. When I execute the ...

Ícaro Lorran

71

asked Dec 11, 2024 at 9:12

0votes

0answers

37views

How to determine the best number of cores and memory for Spark job

How can we determine the optimal number of cores and memory for running Spark jobs based on data volume, the number of jobs, and their frequency? From what I've read, we can determine the number of ...

Keyser

1

asked Jun 7, 2024 at 5:28

3votes

0answers

168views

Clustering large set of images

I've got some big datasets of images (a few million each), and I would like to cluster them according to images' visual similarities. I've extracted a feature vector for each image; the space of ...

Overloop

31

asked Jan 25, 2021 at 9:44

4votes

1answer

1kviews

What is the main difference between Hadoop and Spark? [closed]

I recently read the following about Hadoop vs. Spark: Insist upon in-memory columnar data querying. This was the killer-feature that let Apache Spark run in seconds the queries that would take Hadoop ...

Ironclad

41

asked Sep 5, 2020 at 11:28

2votes

1answer

316views

Spark: How to run PCA parallelized? Only one thread used

I use pySpark and set my configuration like following: ...

bonfab

281

asked Jun 28, 2020 at 13:47

1vote

0answers

34views

What is important for Pharmaceutical companies to answer with Big Data Analysis?

I am a data scientist, and I have some biological background (genetics). I have been asked to give a talk for our customers from pharmaceutical industry. I should show them how they benefit from Big ...

Rebecca

143

asked Jun 22, 2020 at 11:24

0votes

1answer

125views

Creating more than one worker nodes for local windows machine [closed]

I am using windows laptop. And I installed apache spark for my laptop. And I try to measure spark performance by changing spark components. because of that I want to create more than one worker nodes ...

randunu galhena

69

asked Apr 23, 2020 at 2:22

0votes

1answer

527views

How to run Spark python code in Jupyter Notebook via command prompt

I am trying to import a data frame into spark using Python's pyspark module. For this, I used Jupyter Notebook and executed the code shown in the screenshot below After that I want to run this in CMD ...

randunu galhena

69

asked Apr 15, 2020 at 3:37

2votes

0answers

293views

How to create tensors in spark?

I have the following data stored in HDFS: each row has three columns, id, date, item, which means a person with a particular id bought a particular item on a particular date. The dataset has billions ...

neverevernever

121

asked Aug 6, 2019 at 23:46

1vote

0answers

41views

Is there a way to use a pom.xml file to update spark configuration?

I am trying to update my spark configuration to solve some dependency problems. This pom.xml file seems to be useful for this purpose. I am using a spark docker image. ...

Jay

425

asked Jun 21, 2019 at 0:01

1vote

0answers

513views

Spark Scala concatenate 2 different data frames

I have 2 different Spark data frames and I want to concatenate them together by columns with no join operations. How can I do it using Scala?

3nomis

541

asked May 17, 2019 at 9:29

3votes

1answer

14kviews

Spark DataFrame "Limit" function takes too much time to display result

...

Taimur Islam

951

asked Feb 11, 2019 at 9:57

3votes

2answers

836views

Navigating the jungle of choices for scalable ML deployment

I have prototyped a machine learning (ML) model on my local machine and would like to scale it to both train and serve on much larger datasets than could be feasible on a single machine. (The model ...

Tfovid

195

asked Sep 7, 2018 at 7:22

2votes

0answers

2kviews

Alternative to Apache Spark? [closed]

I have been looking for a comprehensive alternative to Apache Spark for Big Data analytics/machine learning and couldn't find one. The ones which I have come across are: Apache Flink Google DataFlow ...

Osama Dar

609

asked Sep 4, 2018 at 4:45

2votes

1answer

73views

does storing file in hdfs parallelize it for Spark?

For Spark's RDD operations, data must be in shape of RDD or be parallelized using: ParallelizedData = sc.parallelize(data) My question is that if I store data in ...

Ali Majed HA

143

asked Apr 28, 2018 at 18:51

Stack Exchange Network

All Questions

Stuck on loading parquet files recursively of varying size with Spark

How to determine the best number of cores and memory for Spark job

Clustering large set of images

What is the main difference between Hadoop and Spark? [closed]

Spark: How to run PCA parallelized? Only one thread used

What is important for Pharmaceutical companies to answer with Big Data Analysis?

Creating more than one worker nodes for local windows machine [closed]

How to run Spark python code in Jupyter Notebook via command prompt

How to create tensors in spark?

Is there a way to use a pom.xml file to update spark configuration?

Spark Scala concatenate 2 different data frames

Spark DataFrame "Limit" function takes too much time to display result

Navigating the jungle of choices for scalable ML deployment

Alternative to Apache Spark? [closed]

does storing file in hdfs parallelize it for Spark?

Hot Network Questions

All Questions

Related Tags